Empirical Bayes Matrix Factorization

Matrix factorization methods, which include Factor analysis (FA) and Principal Components Analysis (PCA), are widely used for inferring and summarizing structure in multivariate data. Many such methods use a penalty or prior distribution to achieve sparse representations (“Sparse FA/PCA”), and a key question is how much sparsity to induce. Here we introduce a general Empirical Bayes approach to matrix factorization (EBMF), whose key feature is that it estimates the appropriate amount of sparsity by estimating prior distributions from the observed data. The approach is very flexible: it allows for a wide range of different prior families and allows that each component of the matrix factorization may exhibit a different amount of sparsity. The key to this flexibility is the use of a variational approximation, which we show effectively reduces fitting the EBMF model to solving a simpler problem, the so-called “normal means” problem. We demonstrate the benefits of EBMF with sparse priors through both numerical comparisons with competing methods and through analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons EBMF often provides more accurate inferences than other methods. In the GTEx data, EBMF identifies interpretable structure that agrees with known relationships among human tissues. Software implementing our approach is available at https://github.com/stephenslab/flashr.


Introduction
Matrix factorization methods are widely used for inferring and summarizing structure in multivariate data.In brief, these methods represent an observed n × p data matrix Y as: where L is an K × n matrix, F is a K × p matrix, and E is an n × p matrix of residuals (whose entries we assume to be normally distributed, although the methods we develop can be generalized to other settings; see Section 6.4).Here we adopt the notation and terminology of factor analysis, and refer to L as the "loadings" and F as the "factors".
The model (1.1) has many potential applications.One range of applications arise from "matrix completion" problems (e.g.Fithian et al., 2018): methods that estimate L and F in (1.1) from partially observed Y provide a natural and effective way to fill in the missing entries.Another wide range of applications come from the desire to summarize and understand the structure in a matrix Y : in (1.1) each row of Y is approximated by a linear combination of underlying "factors" (rows of), which-ideally-have some intuitive or scientific interpretation.For example, suppose Y ij represents the rating of a user i for a movie j.Each factor might represent a genre of movie ("comedy", "drama", "romance", "horror" etc), and the ratings for a user i could be written as a linear combination of these factors, with the weights (loadings) representing how much individual i likes that genre.Or, suppose Y ij represents the expression of gene j in sample i.Each factor might represent a module of co-regulated genes, and the data for sample i could be written as a linear combination of these factors, with the loadings representing how active each module is in each sample.Many other examples could be given across many fields, including psychology (Ford et al., 1986), econometrics (Bai and Ng, 2008), natural language processing (Bouchard et al., 2015), population genetics (Engelhardt and Stephens, 2010), and functional genomics (Stein-O'Brien et al., 2018).
The simplest approaches to estimating L and/or F in (1.1) are based on maximum likelihood or least squares.For example, Principal Components Analysis (PCA)-or, more precisely, truncated Singular Value Decomposition (SVD)-can be interpreted as fitting (1.1) by least squares, assuming that columns of L are orthogonal and columns of F are orthonormal (Eckart and Young, 1936).And classical factor analysis (FA) corresponds to maximum likelihood estimation of L, assuming that the elements of F are independent standard normal and allowing different residual variances for each column of Y (Rubin and Thayer, 1982).While these simple methods remain widely used, in the last two decades researchers have focused considerable attention on obtaining more accurate and/or more interpretable estimates, either by imposing additional constraints (e.g.non-negativity; Lee and Seung, 1999) or by regularization using a penalty term (e.g.Jolliffe et al., 2003;Witten et al., 2009;Mazumder et al., 2010;Hastie et al., 2015;Fithian et al., 2018), or a prior distribution (e.g.Bishop, 1999;Attias, 1999;Ghahramani and Beal, 2000;West, 2003).In particular, many authors have noted the benefits of sparsity assumptions on L and/or F -particularly in applications where interpretability of the estimates is desired-and there now exists a wide range of methods that attempt to induce sparsity in these models (e.g.Sabatti and James, 2005;Zou et al., 2006;Pournara and Wernisch, 2007;Carvalho et al., 2008;Witten et al., 2009;Engelhardt and Stephens, 2010;Knowles and Ghahramani, 2011;Bhattacharya and Dunson, 2011;Mayrink et al., 2013;Yang et al., 2014;Gao et al., 2016;Hore et al., 2016;Ročková and George, 2016;Srivastava et al., 2017;Kaufmann and Schumacher, 2017;Frühwirth-Schnatter and Lopes, 2018;Zhao et al., 2018).Many of these methods induce sparsity in the loadings only, although some induce sparsity in both loadings and factors.
In any statistical problem involving sparsity, a key question is how strong the sparsity should be.In penalty-based methods this is controlled by the strength and form of the penalty, whereas in Bayesian methods it is controlled by the prior distributions.In this paper we take an Empirical Bayes approach to this problem, exploiting variational approximation methods (Blei et al., 2017) to obtain simple algorithms that jointly estimate the prior distributions for both loadings and factors, as well as the loadings and factors themselves.
Both EB and variational methods have been previously used for this problem (Bishop, 1999;Lim and Teh, 2007;Raiko et al., 2007;Stegle et al., 2010).However, most of this previous work has used simple normal prior distributions that do not induce sparsity.Variational methods that use sparsity-inducing priors include Girolami (2001), which uses a Laplace prior on the factors (no prior on the loadings which are treated as free parameters), Hochreiter et al. (2010) which extends this to Laplace priors on both factors and loadings, with fixed values of the Laplace prior parameters; Titsias and Lázaro-Gredilla (2011), which uses a sparse "spike-and-slab" (point normal) prior on the loadings (with the same prior on all K loadings) and a normal prior on the factors; and Hore et al. ( 2016) which uses a spike-and-slab prior on one mode in a tensor decomposition.(While this work was in review further examples appeared, including Argelaguet et al. 2018, which uses normal priors on the loadings and point-normal on the factors.) Our primary contribution here is to develop and implement a more general EB approach to matrix factorization (EBMF).This general approach allows for a wide range of potential sparsity-inducing prior distributions on both the loadings and the factors within a single algorithmic framework.We accomplish this by showing that, when using variational methods, fitting EBMF with any prior family can be reduced to repeatedly solving a much simpler problem-the "empirical Bayes normal means" (EBNM) problem-with the same prior family.This feature makes it easy to implement methods for any desired prior familyone simply has to implement a method to solve the corresponding normal means problem, and then plug this into our algorithm.This approach can work for both parametric families (e.g.normal, point-normal, laplace, point-laplace) and non-parametric families, including the "adaptive shrinkage" priors (unimodal and scale mixtures of normals) from Stephens (2017).It is also possible to accommodate non-negative constraints on either L and/or F by using non-negative prior families.Even simple versions of our approach-e.g. using point-normal priors on both factors and loadings-provide more generality than most existing EBMF approaches and software.
A second contribution of our work is to highlight similarities and differences between EBMF and penalty-based methods for regularizing L and/or F .Indeed, our algorithm for fitting EBMF has the same structure as commonly-used algorithms for penalty-based methods, with the prior distribution playing a role analogous to the penalty (see Remark 3 later).While the general correspondence between estimates from penalized methods and Bayesian posterior modes (MAP estimates) is well known, the connection here is different, because the EBMF approach is estimating a posterior mean, not a mode (indeed, with sparse priors the MAP estimates of L and F are not useful because they are trivially 0).A key difference between the EBMF approach and penalty-based methods is that the EBMF prior is estimated by solving an optimization problem, whereas in penalty-based methods the strength of the penalty is usually chosen by cross-validation.This difference makes it much easier for EBMF to allow for different levels of sparsity in every factor and every loading: in EBMF one simply uses a different prior for every factor and loading, whereas tuning a separate parameter for every factor and loading by CV becomes very cumbersome.
The final contribution is that we provide an R software package, flash (Factors and Loadings by Adaptive SHrinkage), implementing our flexible EBMF framework.We demonstrate the utility of these methods through both numerical comparisons with competing methods and through a scientific application: analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues.In numerical comparisons flash often provides more accurate inferences than other methods, while remaining computationally tractable for moderate-sized matrices (millions of entries).In the GTEx data, flash highlight both effects that are shared across many tissues ("dense" factors) and effects that are specific to a small number of tissues ("sparse" factors).These sparse factors often highlight similarities between tissues that are known to be biologically related, providing external support for the reliability of the results.

A General Empirical Bayes Matrix Factorization Model
We define the K-factor Empirical Bayes Matrix Factorization (EBMF) model as follows: Here Y is the n × p observed data matrix, l k is an n-vector (the kth set of "loadings"), f k is a p-vector (the kth "factor"), G l and G f are pre-specified (possibly non-parametric) families of distributions, g l k and g f k are unknown "prior" distributions that are to be estimated, E is an n × p matrix of independent error terms, and τ is an unknown n × p matrix of precisions τ ij which is assumed to lie in some space T. (This allows structure to be imposed on τ, such as constant precision, τ ij = τ, or column-specific precisions, τ ij = τ j , for example.)Our methods allow that some elements of Y may be "missing", and can estimate the missing values (Section 4.1).
The term "Empirical Bayes" in EBMF means we fit (2.1)-(2.4)by obtaining point estimates for the priors g l k , g f k , k = 1, …, K and approximate the posterior distributions for the parameters l k , f k given those point estimates.This contrasts with a "fully Bayes" approach that, instead of obtaining point estimates for g l k , g f k , would integrate over uncertainty in the estimates.This would involve specifying prior distributions for g l k , g f k as well as (perhaps substantial) additional computation.The EB approach has the advantage of simplicityboth conceptually and computationally-while enjoying many of the benefits of a fully Bayes approach.In particular it allows for sharing of information across elements of each loading/factor.For example, if the data suggest that a particular factor, f k , is sparse, then this will be reflected in a sparse estimate of g f k , and subsequently strong shrinkage of the smaller elements of f k1 , …, f kp towards 0. Conversely, when the data suggest a non-sparse factor then the prior will be dense and the shrinkage less strong.By allowing different prior distributions for each factor and each loading, the model has the flexibility to adapt to any combination of sparse and dense loadings and factors.However, to fully capitalize on this flexibility one needs suitably flexible prior families G l and G f capable of capturing both sparse and dense factors.A key feature of our work is it allows for very flexible prior families, including non-parametric families.
Some specific choices of the distributional families G l and G f correspond to models used in previous work.In particular, many previous papers have studied the case with normal priors, where G l and G f are both the family of zero-mean normal distributions (e.g.Bishop, 1999;Lim and Teh, 2007;Raiko et al., 2007;Nakajima and Sugiyama, 2011).This family is particularly simple, having a single hyper-parameter, the prior variance, to estimate for each factor.However, it does not induce sparsity on either L or F ; indeed, when the matrix Y is fully observed, the estimates of L and F under a normal prior (when using a fully factored variational approximation) are simply scalings of the singular vectors from an SVD of Y (Nakajima and Sugiyama, 2011;Nakajima et al., 2013).Our work here extends these previous approaches to a much wider range of prior families that do induce sparsity on L and/or F .
We note that the EBMF model (2.1)-(2.4)differs in an important way from the sparse factor analysis (SFA) methods in Engelhardt and Stephens (2010), which use a type of Automatic Relevance Determination prior (e.g.Tipping, 2001;Wipf and Nagarajan, 2008) to induce sparsity on the loadings matrix.In particular, SFA estimates a separate hyperparameter for every element of the loadings matrix, with no sharing of information across elements of the same loading.In contrast, EBMF estimates a single shared prior distribution for elements of each loading, which, as noted above, allows for sharing of information across elements of each loading/factor.

Fitting the EBMF Model
To simplify exposition we begin with the case K = 1 ("rank 1"); see Section 3.6 for the extension to general K.To simplify notation we assume the families G l , G f are the same, so we can write G l = G f = G.To further lighten notation in the case K = 1 we use g l , g f , l, f instead of g l 1 , g f 1 , l 1 , f 1 .
Fitting the EBMF model involves estimating all of g l , g f , l, f, τ.A standard EB approach would be to do this in two steps: • Estimate g l , g f and τ, by maximizing the likelihood: L g l , g f , τ ≔ ∬ p Y | l, f, τ g l dl 1 …g l dl n g f df 1 …g f df p (3.1) over g l , g f ∈ G, τ ∈ T. (This optimum will typically not be unique because of identifiability issues; see Section 3.8.)
However, both these two steps are difficult, even for very simple choices of G. Instead, following previous work (see Introduction for citations) we use variational approximations to approximate this approach.Although variational approximations are known to typically under-estimate uncertainty in posterior distributions, our focus here is on obtaining useful point estimates for l, f; results shown later demonstrate that the variational approximation can perform well in this task.

The Variational Approximation
The variational approach-see Blei et al. ( 2017) for review-begins by writing the log of the likelihood (3.1) as: where F q, g l , g f , τ = ∫ q l, f log p Y , l, f | g l , g f , τ q l, f dldf, (3.4) and D KL q p = − ∫ q l, f log p l, f | Y , g l , g f , τ q l, f dldf is the Kullback-Leibler divergence from q to p.This identity holds for any distribution q l, f .Because D KL is non-negative, it follows that F q, g l , g f , τ is a lower bound for the log likelihood: with equality when q l, f = p l, f | Y , g l , g f , τ .
In other words, l g l , g f , τ = max q F q, g l , g f , τ , where the maximization is over all possible distributions q l, f .Maximizing l g l , g f , τ can thus be viewed as maximizing F over q, g l , g f , τ.However, as noted above, this maximization is difficult.The variational approach simplifies the problem by maximizing F but restricting the family of distributions for q.Specifically, the most common variational approach -and the one we consider here-restricts q to the family Q of distributions that "fully-factorize": (3.8) The variational approach seeks to optimize F over q, g l , g f , τ with the constraint q ∈ Q.For q ∈ Q we can write q l, f = q l l q f f where q l l = ∏ i = 1 n q l, i l i and q f f = ∏ j = 1 p q f, j f j , and we can consider the problem as maximizing F q l , q f , g l , g f , τ .

Alternating Optimization
We optimize F q l , q f , g l , g f , τ by alternating between optimizing over variables related to l q l , g l , over variables related to f q f , g f , and over τ.Each of these steps is guaranteed to increase (or, more precisely, not decrease) F , and convergence can be assessed by (for example) stopping when these optimization steps yield a very small increase in F .Note that F may be multi-modal, and there is no guarantee that the algorithm will converge to a global optimum.The approach is summarized in Algorithm 1 The key steps in Algorithm 1 are the maximizations in Steps 4-6.
Step 4, the update of τ, involves computing the expected squared residuals: This is straightforward provided the first and second moments of q l and q f are available (see Appendix A.1 for details).
Steps 5 and 6 are essentially identical except for switching the role of l and f.One of our key results is that each of these steps can be achieved by solving a simpler problem-the Empirical Bayes normal means (EBNM) problem.The next subsection (3.3) describes the EBNM problem, and the following subsection (3.4) details how this can be used to solve Steps 5 and 6.

The EBNM Problem
Suppose we have observations x = x 1 , …, x n of underlying quantities θ = θ 1 , …, θ n , with independent Gaussian errors with known standard deviations s = s 1 , …, s n .Suppose further that the elements of θ are assumed i.i.d.from some distribution, g ∈ G.That is, where N n μ, Σ denotes the n-dimensional normal distribution with mean μ and covariance matrix Σ.
By solving the EBNM problem we mean fitting the model (3.11)-(3.12)by the following two-step procedure: 1.
Estimate g by maximum (marginal) likelihood: g ˆ= arg max g ∈ G j ∫ p x j | θ j , s j g dθ j . (3.13)

2.
Compute the posterior distribution for θ given g ˆ, p θ | x, s, g ˆ∝ j g ˆθj p x j | θ j , s j . (3.14) Later in this paper we will have need for the posterior first and second moments, so we define them here for convenience: Formally, this procedure defines a mapping (which depends on the family ) from the known quantities x, s , to g ˆ, p , where g ˆ, p are given in (3.13) and (3.14).We use EBNM to denote this mapping: (3.17) Remark 1 Solving the EBNM problem is central to all our algorithms, so it is worth some study.A key point is that the EBNM problem provides an attractive and flexible way to induce shrinkage and/or sparsity in estimates of θ.For example, if θ is truly sparse, with many elements at or near 0, then the estimate g ˆ will typically have considerable mass near 0, and the posterior means (3.15) will be "shrunk" strongly toward 0 compared with the original observations.In this sense solving the EBNM problem can be thought of as a model-based analogue of thresholding-based methods, with the advantage that by estimating g from the data the EBNM approach automatically adapts to provide an appropriate level of shrinkage.These ideas have been used in wavelet denoising (Clyde and George, 2000;Johnstone et al., 2004;Johnstone and Silverman, 2005a;Xing et al., 2016), and false discovery rate estimation (Thomas et al., 1985;Stephens, 2017) for example.Here we apply them to matrix factorization problems.

Connecting the EBMF and EBNM Problems
The EBNM problem is well studied, and can be solved reasonably easily for many choices of G (e.g.Johnstone and Silverman, 2005b;Koenker and Mizera, 2014a;Stephens, 2017).
In Section 4 we give specific examples; for now our main point is that if one can solve the EBNM problem for a particular choice of G then it can be used to implement Steps 5 and 6 in Algorithm 1 for the corresponding EBMF problem.The following Proposition formalizes this for Step 5 of Algorithm 1; a similar proposition holds for Step 6 (see also Appendix A).

Proposition 2
Step 5 in Algorithm 1 is solved by solving an EBNM problem.Specifically where the functions l ˆ: R n × p × R p × R p × R n × p R n and s l : R p × R n × p R n are given by For intuition into where the EBNM in Proposition 2 comes from, consider estimating l, g l in (2.1) with f and τ known.The model then becomes n independent regressions of the rows of Y on f, and the maximum likelihood estimate for l has elements: with standard errors Further, it is easy to show that l ˆi N l i , s i 2 . (3.25) Combining (3.25) with the prior l 1 , …, l n iid g l , g l ∈ G (3.26) yields an EBNM problem.
The EBNM in Proposition 2 is the same as the EBNM (3.25)-(3.26),but with the terms f j and f j 2 replaced with their expectations under q f .Thus, the update for q l , g l in Algorithm 1, with q f , g f , τ fixed, is closely connected to solving the EBMF problem for "known f, τ ".

Streamlined Implementation Using First and Second Moments
Although Algorithm 1, as written, optimizes over q l , q f , g l , g f , in practice each step requires only the first and second moments of the distributions q l and q f .For example, the EBNM problem in Proposition 1 involves f − and f 2 − and not g f .Consequently, we can simplify implementation by keeping track of only those moments.In particular, when solving the normal means problem, EBNM x, s in (3.17), we need only return the posterior first and second moments (3.15) and (3.16).This results in a streamlined and intuitive implementation, summarized in Algorithm 2.
Remark 3 Algorithm 2 has a very intuitive form: it has the flavor of an alternating least squares algorithm, which alternates between estimating l given f (Step 6) and f given l (Step 8), but with the addition of the ebnm step (Steps 7 and 9), which can be thought of as regularizing or shrinking the estimates: see Remark 1.This viewpoint highlights connections with related algorithms.For example, the (rank 1 version of the) SSVD algorithm from Yang et al. ( 2014) has a similar form, but uses a thresholding function in place of the ebnm function to induce shrinkage and/or sparsity.

The K-factor EBMF Model
It is straightforward to extend the variational approach to fit the general K factor model (2.1)-(2.4).In brief, we introduce variational distributions q l k , q f k for k = 1, …, K, and then optimize the objective function F q l 1 , g l 1 , q f 1 , g f 1 ; …; q l K , g l K , q f K , g f K ; τ .Similar to the rank-1 model, this optimization can be done by iteratively updating parameters relating to a single loading or factor, keeping other parameters fixed.And again we simplify implementation by keeping track of only the first and second moments of the distributions q l k and q f k , which ) are essentially identical to those for fitting the rank 1 model above, but with Y ij replaced with the residuals obtained by removing the estimated effects of the other k − 1 factors: (3.27) Based on this approach we have implemented two algorithms for fitting the K-factor model.First, a simple "greedy" algorithm, which starts by fitting the rank 1 model, and then adds factors k = 2, …, K, one at a time, optimizing over the new factor parameters before moving on to the next factor.Second, a "backfitting" algorithm (Breiman and Friedman, 1985), which iteratively refines the estimates for each factor given the estimates for the other factors.Both algorithms are detailed in Appendix A.

Selecting K
An interesting feature of EB approaches to matrix factorization, noted by Bishop (1999), is that they automatically select the number of factors K.This is because the maximum likelihood solution to g l k , g f k is sometimes a point mass on 0 (provided G includes this distribution).Furthermore, the same is true of the solution to the variational approximation (see also Bishop, 1999;Stegle et al., 2012).This means that if K is set sufficiently large then some loading/factor combinations will be optimized to be exactly 0. (Or, in the greedy approach, which adds one factor at a time, the algorithm will eventually add a factor that is exactly 0, at which point it terminates.) Here we note that the variational approximation may be expected to result in conservative estimation (i.e.underestimation) of K compared with the (intractable) use of maximum likelihood to estimate g l , g f .We base our argument on the simplest case: comparing K = 1 vs K = 0. Let δ 0 denote the degenerate distribution with all its mass at 0. Note that the rank-1 factor model (2.1), with g l = δ 0 (or g f = δ 0 ) is essentially a "rank-0" model.Now note that the variational lower bound, F , is exactly equal to the log-likelihood when g l = δ 0 (or g f = δ 0 ).This is because if the prior is a point mass at 0 then the posterior is also a point mass, which trivially factorizes as a product of point masses, and so the variational family Q includes the true posterior in this case.Since F is a lower bound to the log-likelihood we have the following simple lemma: Lemma 4 If F q ˆ, g ˆl, g ˆf, τ ˆ> F δ 0 , δ 0 , δ 0 , τ ˆ0 then l g ˆl, g ˆf, τ ˆ> l δ 0 , δ 0 , τ ˆ0 .

■
Thus, if the variational approximation F favors g ˆl, g ˆf, τ ˆ over the rank 0 model, then it is guaranteed that the likelihood would also favor g ˆl, g ˆf, τ ˆ over the rank 0 model.In other words, compared with the likelihood, the variational approximation is conservative in terms of preferring the rank 1 model to the rank 0 model.This conservatism is a double-edged sword.
On the one hand it means that if the variational approximation finds structure it should be taken seriously.On the other hand it means that the variational approximation could miss subtle structure.
In practice Algorithm 2 can converge to a local optimum of F that is not as high as the trivial (rank 0) solution, F δ 0 , δ 0 , δ 0 , τ ˆ0 .We can add a check for this at the end of Algorithm 2, and set g ˆl = g ˆf = δ 0 and τ ˆ= τ ˆ0 when this occurs.

Identifiability
In EBMF each loading and factor is identifiable, at best, only up to a multiplicative constant (provided G is a scale family).Specifically, scaling the prior distributions g f k and g l k by c k and 1/c k respectively results in the same marginal likelihood, and also results in a corresponding scaling of the posterior distribution on the factors f k and loadings l k (e.g. it scales the posterior first moments by c k , 1/c k and the second moments by c k 2 , 1/c k 2 .However, this nonidentifiability is not generally a problem, and if necessary it could be dealt with by re-scaling factor estimates to have norm 1.

Software Implementation: flash
We have implemented Algorithms 2, 4 and 5 in an R package, flash ("factors and loadings via adaptive shrinkage").These algorithms can fit the EBMF model for any choice of distributional family G l , G f : the user must simply provide a function to solve the EBNM problem for these prior families.
One source of functions for solving the EBNM problem is the "adaptive shrinkage" (ashr) package, which implements methods from Stephens (2017).These methods solve the EBNM problem for several flexible choices of G, including: the set of all scale mixtures of zero-centered normals; • G = SU, the set of all symmetric unimodal distributions, with mode at 0; the set of all unimodal distributions, with mode at 0; , the set of all non-negative unimodal distributions, with mode at 0.
These methods are computationally stable and efficient, being based on convex optimization methods (Koenker and Mizera, 2014b) and analytic Bayesian posterior computations.
We have also implemented functions to solve the EBNM problem for additional choices of G in the package ebnm (https://github.com/stephenslab/ebnm).These include G being the "point-normal" family: • G = PN, the set of all distributions that are a mixture of a point mass at zero and a normal with mean 0.
This choice is less flexible than those in ashr, and involves non-convex optimizations, but can be faster.
Although in this paper we focus our examples on sparsity-inducing priors with G l = G f = G we note that our software makes it easy to experiment with different choices, some of which represent novel methodologies.For example, setting G l = U + and G f = SN yields an EB version of semi-non-negative matrix factorization (Ding et al., 2008), and we are aware of no existing EB implementations for this problem.Exploring the relative merits of these many possible options in different types of application will be an interesting direction for future work.

Missing Data
If some elements of Y are missing, then this is easily dealt with.For example, the sums over j in (3.19) and (3.20) are simply computed using only the j for which Y ij is not missing.
This corresponds to an assumption that the missing elements of Y are "missing at random" (Rubin, 1976).In practice we implement this by setting τ ij = 0 whenever Y ij is missing (and filling in the missing entries of Y to an arbitrary number).This allows the implementation to exploit standard fast matrix multiplication routines, which cannot handle missing data.If many data points are missing then it may be helpful to exploit sparse matrix routines.

Initialization
Both Algorithms 2 and 4 require a rank 1 initialization procedure, init.Here, we use the softImpute function from the package softImpute (Mazumder et al., 2010), with penalty parameter λ = 0, which essentially performs SVD when Y is completely observed, but can also deal with missing values in Y .
The backfitting algorithm (Algorithm 5) also requires initialization.One option is to use the greedy algorithm to initialize, which we call "greedy+backfitting".

Numerical Comparisons
We now compare our methods with several competing approaches.To keep these comparisons manageable in scope we focus attention on methods that aim to capture possible sparsity in L and/or F .For EBMF we present results for two different shrinkageoriented prior families, G: the scale mixture of normals G = SN , and the point-normal family G = PN .We denote these flash and flash_pn respectively when we need to distinguish.Finally, for reference we also use standard (truncated) SVD.
All of the Bayesian methods (flash, SFA, SFAmix and NBSFA) are "self-tuning", at least to some extent, and we applied them here with default values.According to Yang et al.
(2014) SSVD is robust to choice of tuning parameters, so we also ran SSVD with its default values, using the robust option (method="method").The softImpute method has a single tuning parameter (λ, which controls the nuclear norm penalty), and we chose this penalty by orthogonal cross-validation (OCV; Appendix B).The PMD method can use two tuning parameters (one for l and one for f) to allow different sparsity levels in l vs f.However, since tuning two parameters can be inconvenient it also has the option to use a single parameter for both l and f.We used OCV to tune parameters in both cases, referring to the methods as PMD.cv2 (2 tuning parameters) and PMD.cv1 (1 tuning parameter).

A Single Factor
Example-We simulated data with n = 200, p = 300 under the single-factor model (2.1) with sparse loadings, and a non-sparse factor: where δ 0 denotes a point mass on 0, and σ 1 2 , …, σ 5 2 ≔ 0.25,0.5,1,2,4 .We simulated using three different levels of sparsity on the loadings, using π 0 = 0.9,0.3,0.(We set the noise precision τ = 1,1/16,1/25 in these three cases to make each problem not too easy and not too hard.) We applied all methods to this rank-1 problem, specifying the true value K = 1.(The NBSFA software does not provide the option to fix K, so is omitted here.)We compare methods in their accuracy in estimating the true low-rank structure B ≔ lf T using relative root mean squared error: (5.3) Despite the simplicity of this simulation, the methods vary greatly in performance (Figure 1).Both versions of flash consistently outperform all the other methods across all scenarios (although softImpute performs similarly in the non-sparse case).The next best performances come from softImpute (SI.cv),PMD.cv2 and SFA, whose relative performances depend on the scenario.All three consistently improve on, or do no worse than, SVD.PMD.cv1 performs similarly to SVD.The SFAmix method performs very variably, sometimes providing very poor estimates, possibly due to poor convergence of the MCMC algorithm (it is the only method here that uses MCMC).The SSVD method consistently performs worse than simple SVD, possibly because it is more adapted to both factors and loadings being sparse (and possibly because, following Yang et al. 2014, we did not use CV to tune its parameters).Inspection of individual results suggests that the poor performance of both SFAmix and SSVD is often due to over-shrinking of non-zero loadings to zero.

A Sparse Bi-cluster
Example (Rank 3)-An important feature of our EBMF methods is that they estimate separate distributions g l , g f for each factor and each loading, allowing them to adapt to any combination of sparsity in the factors and loadings.This flexibility is not easy to achieve in other ways.For example, methods that use CV are generally limited to one or two tuning parameters because of the computational difficulties of searching over a larger space.
We applied flash, softImpute, SSVD and PMD to this example.(We excluded SFA and SFAmix since these methods do not model sparsity in both factors and loadings.)The results (Figure 2) show that again flash consistently outperforms the other methods, and again the next best is softImpute.On this example both SSVD and PMD outperform SVD.Although SSVD and PMD perform similarly on average, their qualitative behavior is different: PMD insufficiently shrink the 0 values, whereas SSVD shrinks the 0 values well but overshrinks some of the signal, essentially removing the smallest of the three loading/ factor combinations (Figure 2 b).

Missing Data Imputation for Real Data Sets
Here we compare methods in their ability to impute missing data using five real data sets.
In each case we "hold out" (mask) some of the data points, and then apply the methods to obtain estimates of the missing values.The data sets are as follows: MovieLens 100 K data, an (incomplete) 943 × 1682 matrix of user-movie ratings (integers from 1 to 5) (Harper and Konstan, 2016).Most users do not rate most movies, so the matrix is sparsely observed (94% missing), and contains about 100K observed ratings.We hold out a fraction of the observed entries and assess accuracy of methods in estimating these.We centered and scaled the ratings for each user before analysis.
GTEx eQTL summary data, a 16 069 × 44 matrix of Z scores computed testing association of genetic variants (rows) with gene expression in different human tissues (columns).These data come from the Genotype Tissue Expression (GTEx) project (Consortium et al., 2015), which assessed the effects of thousands of "eQTLs" across 44 human tissues.(An eQTL is a genetic variant that is associated with expression of a gene.)To identify eQTLs, the project tested for association between expression and every near-by genetic variant, each test yielding a Z score.The data used here are the Z scores for the most significant genetic variant for each gene (the "top" eQTL).See Section 5.3 for more detailed analyses of these data.
Brain Tumor data, a 43 × 356 matrix of gene expression measurements on 4 different types of brain tumor (included in the denoiseR package, Josse et al., 2018).We centered each column before analysis.
Presidential address data, a 13 × 836 matrix of word counts from the inaugural addresses of 13 US presidents  (also included in the denoiseR package, Josse et al., 2018).Since both row and column means vary greatly we centered and scaled both rows and columns before analysis, using the biScale function from softImpute.
Breast cancer data, a 251 × 226 matrix of gene expression measurements from Carvalho et al. ( 2008), which were used as an example in the paper introducing NBSFA (Knowles and Ghahramani, 2011).Following Knowles and Ghahramani (2011) we centered each column (gene) before analysis.
Among the methods considered above, only flash, PMD and softImpute can handle missing data.We add NBSFA (Knowles and Ghahramani, 2011) to these comparisons.To emphasize the importance of parameter tuning we include results for PMD and softImpute with default settings (denoted PMD, SI) as well as using cross-validation (PMD.cv1,SI.cv).
For these real data the appropriate value of K is, of course, unknown.Both flash and NBSFA automatically estimate K.For PMD and softImpute we specified K based on the values inferred by flash and NBSFA.(Specifically, we used K = 10,30,20,10,40 respectively for the five data sets.) We applied each method to all 5 data sets, using 10-fold OCV (Appendix B) to mask data points for imputation, repeated 20 times (with different random number seeds) for each data set.We measure imputation accuracy using root mean squared error (RMSE): (5.10) where Ω is the set of indices of the held-out data points.
The results are shown in Figure 3.Although the ranking of methods varies among data sets, flash, PMD.cv1 and SI.cv perform similarly on average, and consistently outperform NBSFA, which in turn typically outperforms (untuned) PMD and unpenalized softImpute.These results highlight the importance of appropriate tuning for the penalized methods, and also the effectiveness of the EB method in flash to provide automatic tuning.
In these comparisons, as in the simulations, the two flash methods typically performed similarly.The exception is the GTEx data, where the scale mixture of normals G = SN performed worse.Detailed investigation revealed this to be due to a very small number of very large "outlier" imputed values, well outside the range of the observed data, which grossly inflated RMSE.These outliers were so extreme that it should be possible to implement a filter to avoid them.However, we did not do this here as it seems useful to highlight this unexpected behavior.(Note that this occurs only when data are missing, and even then only in one of the five data sets considered here.)

Sharing of Genetic Effects on Gene Expression Among Tissues
To illustrate flash in a scientific application, we applied it to the GTEx data described above, a 16,069 × 44 matrix of Z scores, with Z ij reflecting the strength (and direction) of effect of eQTL i in tissue j.We applied flash with G = SN using the greedy+backfitting algorithm (i.e. the backfitting algorithm, initialized using the greedy algorithm).
The flash results yielded 26 factors (Figure 4-5) which summarize the main patterns of eQTL sharing among tissues (and, conversely, the main patterns of tissue-specificity).For example, the first factor has approximately equal weight for every tissue, and reflects the fact that many eQTLs show similar effects across all 44 tissues.The second factor has strong effects only in the 10 brain tissues, from which we infer that some eQTLs show much stronger effects in brain tissues than other tissues.
Subsequent factors tend to be sparser, and many have a strong effect in only one tissue, capturing "tissue-specific" effects.For example, the 3rd factor shows a strong effect only in whole blood, and captures eQTLs that have much stronger effects in whole blood than other tissues.(Two tissues, "Lung" and "Spleen", show very small effects in this factor but with the same sign as blood.This is intriguing since the lung has recently been found to make blood cells-see Lefrançais et al. 2017-and a key role of the spleen is storing of blood cells.)Similarly Factors 7, 11 and 14 capture effects specific to "Testis", "Thyroid" and "Esophagus Mucosa" respectively.
A few other factors show strong effects in a small number of tissues that are known to be biologically related, providing support that the factors identified are scientifically meaningful.For example, factor 10 captures the two tissues related to the cerebellum, "Brain Cerebellar Hemisphere" and "Brain Cerebellum".Factor 19 captures tissues related to female reproduction, "Ovary", "Uterus" and "Vagina".Factor 5 captures "Muscle Skeletal", with small but concordant effects in the heart tissues ("Heart Atrial Appendage" and "Heart Left Ventricle").Factor 4, captures the two skin tissues ("Skin Not Sun Exposed Suprapubic", "Skin Sun Exposed Lower leg") and also "Esophagus Mucosa", possibly reflecting the sharing of squamous cells that are found in both the surface of the skin, and the lining of the digestive tract.In factor 24, "Colon Transverse" and "Small Intestine Terminal Ileum" show the strongest effects (and with same sign), reflecting some sharing of effects in these intestinal tissues.Among the 26 factors, only a few are difficult to interpret biologically (e.g.factor 8).
To highlight the benefits of sparsity, we contrast the flash results with those for softImpute, which was the best-performing method in the missing data assessments on these data, but which uses a nuclear norm penalty that does not explicitly reward sparse factors or loadings.The first eight softImpute factors are shown in Figure 6.The softImpute results-except for the first two factors-show little resemblance to the flash results, and in our view are harder to interpret.

Computational Demands
It is difficult to make general statements about computational demands of our methods, because both the number of factors and number of iterations per factor can vary considerably depending on the data.However, to give a specific example, running our current implementation of the greedy algorithm on the GTEx data (a 16,000 by 44 matrix) takes about 140s (wall time) for G = PN and 650s for G = SN (on a 2015 MacBook Air with a 2.2 GHz Intel Core i7 processor and 8Gb RAM).By comparison, a single run of softImpute without CV takes 2-3s, so a naive implementation of 5-fold CV with 10 different tuning parameters and 10 different values of K would take over 1000s (although one could improve on this by use of warm starts for example).

Discussion
Here we discuss some potential extensions or modifications of our work.

Orthogonality Constraint
Our formulation here does not require the factors or loadings to be orthogonal.In scientific applications we do not see any particular reason to expect underlying factors to be orthogonal.However, imposing such a constraint could have computational or mathematical advantages.Formally adding such a constraint to our objective function seems tricky, but it would be straightforward to modify our algorithms to include an orthogonalization step each update.This would effectively result in an EB version of the SSVD algorithms in Yang et al. ( 2014), and it seems likely to be computationally faster than our current approach.One disadvantage of this approach is that it is unclear what optimization problem such an algorithm would solve (but the same is true of SSVD, and our algorithms have the advantage that they deal with missing data.)

Non-negative Matrix Factorization
We focused here on the potential for EBMF to induce sparsity on loadings and factors.However, EBMF can also encode other assumptions.For example, to assume the loadings and factors are non-negative, simply restrict G to be a family of non-negative-valued distributions, yielding "Empirical Bayes non-negative Matrix Factorization" (EBNMF).Indeed, the ashr software can already solve the EBNM problem for some such families G, and so flash already implements EBNMF.In preliminary assessments we found that the greedy approach is problematic here: the non-negative constraint makes it harder for later factors to compensate for errors in earlier factors.However, it is straightforward to apply the backfitting algorithm to fit EBNMF, with initialization by any existing NMF method.The performance of this approach is an area for future investigation.

Tensor Factorization
It is also straightforward to extend EBMF to tensor factorization, specifically a CANDE-COMP/PARAFAC decomposition (Kolda and Bader, 2009): E ijm iid N 0,1/τ ijm . (6.5) The variational approach is easily extended to this case (a generalization of methods in Hore et al., 2016), and updates that increase the objective function can be constructed by solving an EBNM problem, similar to EBMF.It seems likely that issues of convergence to local optima, and the need for good initializations, will need some attention to obtain good practical performance.However, results in Hore et al. ( 2016) are promising, and the automatic-tuning feature of EB methods seems particularly attractive here.For example, extending PMD to this case-allowing for different sparsity levels in l, f and ℎ -would require 3 penalty parameters even in the rank 1 case, making it difficult to tune by CV.

Non-Gaussian Errors
It is also possible to extend the variational approximations used here to fit non-Gaussian models, such as binomial data; see for example Jaakkola and Jordan (2000); Seeger and Bouchard (2012); Klami (2015).The extension of our EB methods using these ideas is detailed in Wang (2017).
F q l , q f , g l , g f , τ We optimize F by iteratively updating parameters relating to τ, a single loading k q l k , g l k or factor k q f k , g f k , keeping other parameters fixed.We simplify implementation by keeping track of only the first and second moments of the distributions q l k and q f k , which we denote We now describe each kind of update in turn.

A.1 Updates for Precision Parameters
Here we derive updates to optimize F over the precision parameters τ.Focusing on the parts of F that depend on τ gives: where R 2 − is defined by: If we constrain τ ∈ T then we have (A.9) For example, assuming constant precision τ ij = τ yields: Assuming column-specific precisions τ ij = τ j , which is the default in our software, yields: .11)Other variance structures are considered in Appendix A.5 of Wang (2017).

A.2 Updating Loadings and Factors
The following Proposition, which generalizes Proposition 2 in the main text, shows how updates for loadings (and factors) for the K-factor EBMF model can be achieve by solving an EBNM problem.
Proposition 5 For the K-factor model, arg max q l k , g l k F q l , q f , g l , g f , τ is solved by solving an EBNM problem.Specifically A.12) where the functions l ˆ and s l are given by (3.19) and (3.20), f − k , f 2 − k ∈ R p denote the vectors whose elements are the first and second moments of f k under q f k , and R k denotes the residual matrix (3.27).
Similarly, arg max q f k , g f k F q l , q f , g l , g f , τ is solved by solving an EBNM problem.Specifically, A.13) where the functions f ˆ: R n × p × R n × R n × R n × p R p and s f : R n × R n × p R p are given by

A.2.1 A Lemma on the Normal Means Problem
To prove Proposition 5 we introduce a lemma that characterizes the solution of the normal means problem in terms of an objective that is closely related to the variational objective.
Recall that the EBNM model is: .16)Based on these basic updates we implemented two algorithms for fitting the K-factor EBMF model: the greedy algorithm, and the backfitting algorithm, as follows.

A.3.1 Greedy Algorithm
The greedy algorithm is a forward procedure that, at the kth step, adds new factors and loadings l k , f k by optimizing over q l k , q f k , g l k , g f k while keeping the distributions related to previous factors fixed.Essentially this involves fitting the single-factor model to the residuals obtained by removing previous factors.The procedure stops adding factors when the estimated new factors (or loadings) are identically zero.The algorithm as follows:

A.3.2 Backfitting Algorithm
The backfitting algorithm iteratively refines a fit of K factors and loadings, by updating them one at a time, at each update keeping the other loadings and factors fixed.The name comes from its connection with the backfitting algorithm in Breiman and Friedman (1985), specifically the fact that it involves iteratively re-fitting to residuals.

A.4 Objective Function Computation
The algorithms above all involve updates that will increase (or, at least, not decrease) the objective function F q l , q f , g l , g f , τ .However, these updates do not require computing the objective function itself.In iterative algorithms it can be helpful to compute the objective function to monitor convergence (and as a check on implementation).In this subsection we describe how this can be done.In essence, this involves extending the solver of the EBNM problem to also return the value of the log-likelihood achieved in that problem (which is usually not difficult).
The objective function of the EBMF model is: F q l , q f , g l , g f , τ = E q l , q f log p Y | l, f; τ + E q l log g l l q l l + E q f log g f f q f f (A.41) The calculation of E q l , q f log p Y | l, f; τ is straightforward and E q l log g l l q l l and E q l log g l l q l l can be calculated using the log-likelihood of the EBNM model using the following Lemma 7.
Lemma 7 Suppose g ˆ, q solves the EBNM problem with data x, s : g ˆ, q = EBNM x, s , A.42) where q ≔ q 1 , …, q n are the estimated posterior distributions of the normal means parameters θ 1 , …, θ n .Then E q log ∏ j g ˆθj / ∏ j q j θ j = l g ˆ; x, s + 1 2 ∑ j log 2πs j 2 + 1/s j 2 x j 2 + E q θ j 2 − 2x j E q θ j (A.43) where l g ˆ; x, s is the log of the likelihood for the normal means problem (A.27).
Proof We have from (A.28) F NM q θ , g = ∫ q θ θ log p x, θ | g q θ θ dθ (A.44) within the ebnm function used by the greedy or backfitting algorithms) can be thought of as solving a penalized version of the EBMF problem.

Appendix B.: Orthogonal Cross Validation
Cross-validation assessments involving "holding out" (hiding) data from methods.Here we introduce a novel approach to selecting the data to be held out, which we call Orthogonal Cross Validation (OCV).Although not the main focus of our paper, we believe that OCV is a novel and appealing approach to selecting hold-out data for factor models, e.g. when using CV to select an appropriate dimension K for dimension reduction methods, as in Owen and Wang (2016).
Generic k-fold CV involves randomly dividing the data matrix into k parts and then, for each part, training methods on the other k − 1 parts before assessing error on that part, as in Algorithm 6.
The novel part of OCV is in how to choose the "hold-out" pattern.We randomly divide the columns and rows into k sets.and put these sets into k orthogonal parts, and then take all Y ij with the chosen column and row indices as "hold-out" Y i .
To illustrate this scheme, we take 3-fold CV as an example.We randomly divide the columns into 3 sets and the rows into 3 sets as well.The data matrix Y is divided into 9 partition (by row and column permutation): In OCV, each fold k, Y k contains equally balanced part of data matrix and includes all the row and column indices.This ensures that all i's and where Y 1 = Y 11 , Y 22 , Y 33 , Y 2 = Y 12 , Y 23 , Y 31 and Y 3 = Y 13 , Y 21 , Y 32 .We can see for each "hold-out" part, Y k , L 1 , L 2 , L 3 and F 1 , F 2 , F 3 show up once and only once.In this sense the hold-out pattern is "balanced".Results from simulations with sparse bi-cluster structure K = 3 .Results from running flash on GTEx data (factors 1 -8).The pve ("Percentage Variance Explained") for loading/factor k is defined as pve k ≔ s k / ∑ k s k + ∑ ij 1/τ ij where . It is a measure of the amount of signal in the data captured by loading/ factor k (but its naming as "percentage variance explained" should be considered loose since the factors are not orthogonal).Results from running softImpute on GTEx data (factors 1-8).The factors are both less sparse and less interpretable than the flash results.
p denote the vectors whose elements are the first and second moments of f under q f :

Figure 1 :
Figure 1:Boxplots comparing accuracy of flash with several other methods in a simple rank-1 simulation.This simulation involves a single dense factor, and a loading that varies from strong sparsity (90% zeros, left) to no sparsity (right).Accuracy is measured by difference in each methods RRMSE from the flash RRMSE, with smaller values indicating highest accuracy.The y axis is plotted on a nonlinear (square-root) scale to avoid the plots being dominated by poorer-performing methods.

Figure 3 :
Figure 3: Comparison of the accuracy of different methods in imputing missing data.Each panel shows a boxplot of error rates (RMSE) for 20 simulations based on masking observed entries in a real data set.

Figure 5 :
Figure 5: Results from running flash on GTEx data (factors 15 -26) (Mazumder et al., 2010) Sparse Factor Analysis (SFA)(Engelhardt and  Stephens, 2010), SFAmix (Gao et al., 2013), Nonparametric Bayesian Sparse Factor Analysis (NBSFA)(Knowles and Ghahramani, 2011), Penalized Matrix Decomposition  (Witten et al., 2009)(PMD, implemented in the R package PMA), and Sparse SVD(Yang et al., 2014)(SSVD, implemented in R package ssvd).Although the methods we compare against involve only a small fraction of the very large number of methods for this problem, the methods were chosen to represent a wide range of different approaches to inducing sparsity: SFA, SFAmix and NBSFA are three Bayesian approaches with quite different approaches to prior specification; PMD is based on a penalized likelihood with L 1 penalty on factors and/or loadings; and SSVD is based on iterative thresholding of singular vectors.We also compare with softImpute(Mazumder et al., 2010), which does not explicitly model sparsity in L and F , but fits a regularized low-rank matrix using a nuclear-norm penalty.
Then Y 1 = Y 11 , Y 22 , Y 33 , Y 2 = Y 12 , Y 23 , Y 31 and Y 3 = Y 13 , Y 21 , Y 32 are orthogonal to each other.Then the data matrix Y is marked as: j's are included into each Y − k .In 3-fold OCV, we have: